{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "Text_Normalization_Tutorial.ipynb",
      "private_outputs": true,
      "provenance": [],
      "collapsed_sections": [
        "xXRARM8XtK_g",
        "YDVtdJxhZWju",
        "lcvT3P2lQ_GS",
        "GYylwvTX2VSF"
      ],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "x0DJqotopcyb"
      },
      "source": [
        "\"\"\"\n",
        "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
        "\n",
        "Instructions for setting up Colab are as follows:\n",
        "1. Open a new Python 3 notebook.\n",
        "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
        "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
        "4. Run this cell to set up dependencies.\n",
        "\"\"\"\n",
        "# If you're using Google Colab and not running locally, run this cell\n",
        "\n",
        "# install NeMo\n",
        "BRANCH = 'r1.0.0rc1'\n",
        "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CH7yR7cSwPKr"
      },
      "source": [
        "import json\n",
        "import os\n",
        "import wget\n",
        "import numpy as np\n",
        "import inspect\n",
        "import regex as re\n",
        "import sys\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xXRARM8XtK_g"
      },
      "source": [
        "# Introduction\n",
        "Text normalization for Text to Speech (TTS) converts text into its verbalized form. That is, tokens belonging to special semiotic classes to denote things like numbers,\n",
        "times, dates, monetary amounts, etc., that are often written in a way that differs from the\n",
        "way they are verbalized. For example, \"10:00\" -> \"ten o'clock\", \"10:00 a.m.\" -> \"ten a m\", \"10kg\" -> \"ten kilograms\". \n",
        "\n",
        "We use the same semiotic classes as in the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish):\n",
        "PLAIN, PUNCT, DATE, CARDINAL, LETTERS, VERBATIM, MEASURE, DECIMAL, ORDINAL, DIGIT, MONEY, TELEPHONE, ELECTRONIC, FRACTION, TIME, ADDRESS. We additionally added the class `WHITELIST` for all whitelisted tokens whose verbalizations are directly looked up from a user-defined list.\n",
        "\n",
        "This tutorial shows how to use the NeMo rule-based text normalization system.\n",
        "Similar to [The Google Kestrel TTS text normalization\n",
        "system](https://www.researchgate.net/profile/Richard_Sproat/publication/277932107_The_Kestrel_TTS_text_normalization_system/links/57308b1108aeaae23f5cc8c4/The-Kestrel-TTS-text-normalization-system.pdf), the NeMo rule-based system is divided into a tagger and a verbalizer: the tagger is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer takes the output of the tagger and carries out the normalization. \n",
        "In the example 'The alarm goes off at 10:30 a.m.', the tagger for time detects `10:30 a.m.` as a valid time data with `hour=10`, `minutes=30`, `suffix=a.m.`, the verbalizer then turns this into `ten thirty a m`.\n",
        "The system is designed to be easily debuggable and extendable by more rules. We provide both inference for unlabeled and evaluation for labeled data.\n",
        "\n",
        "We provided a set of rules that covers the majority of semiotic classes as found in the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish) for the English language. As with every language there is a long tail of special cases.\n",
        "\n",
        "This tutorial will show how to do prediction on regular text data. It also shows how to do evaluation on a labeled text normalization dataset that follows the format of [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UD-OuFmEOX3T"
      },
      "source": [
        "# If you're running the notebook locally, update the TOOLS_DIR path below\n",
        "# In Colab, a few required scripts will be downloaded from NeMo github\n",
        "NEMO_ROOT = '<UPDATE_PATH_TO_NeMo_root>'\n",
        "TOOLS_DIR = NEMO_ROOT + '/tools/text_normalization/'\n",
        "\n",
        "# append NeMo root directory to python path\n",
        "sys.path.append(NEMO_ROOT)\n",
        "\n",
        "if 'google.colab' in str(get_ipython()):\n",
        "    TOOLS_DIR = 'tools/text_normalization/'\n",
        "\n",
        "TOOLS_DATA_DIR = TOOLS_DIR + \"data/\"\n",
        "\n",
        "if 'google.colab' in str(get_ipython()):\n",
        "    os.makedirs(TOOLS_DIR, exist_ok=True)\n",
        "    os.makedirs(TOOLS_DATA_DIR, exist_ok=True)\n",
        "    \n",
        "    required_files = [\n",
        "      'normalize.py',\n",
        "      'tagger.py',\n",
        "      'utils.py',\n",
        "      'run_evaluate.py',\n",
        "      'run_predict.py',\n",
        "      'verbalizer.py',\n",
        "    ]\n",
        "    required_data_file = [             \n",
        "      'currency.tsv',\n",
        "      'magnitudes.tsv',\n",
        "      'measurements.tsv',\n",
        "      'months.tsv',\n",
        "      'whitelist.tsv'\n",
        "    ]\n",
        "    for file in required_files:\n",
        "        if not os.path.exists(os.path.join(TOOLS_DIR, file)):\n",
        "            file_path = f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/' + TOOLS_DIR + file\n",
        "            print(file_path)\n",
        "            wget.download(file_path, TOOLS_DIR)\n",
        "    for file in required_data_file:\n",
        "        if not os.path.exists(os.path.join(TOOLS_DATA_DIR, file)):\n",
        "            file_path = f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/' + TOOLS_DATA_DIR + file\n",
        "            print(file_path)\n",
        "            wget.download(file_path, TOOLS_DATA_DIR)\n",
        "elif not os.path.exists(TOOLS_DIR):\n",
        "      raise ValueError(f'update path to NeMo root directory')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S1DZk-inQGTI"
      },
      "source": [
        "`TOOLS_DIR` should now contain scripts that we are going to need in the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA/NeMo/tree/main/tools/text_normalization)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1C9DdMfvRFM-"
      },
      "source": [
        "print(TOOLS_DIR)\n",
        "! ls -l $TOOLS_DIR\n",
        "! ls -l $TOOLS_DATA_DIR"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XUEncnqTIzF6"
      },
      "source": [
        "# Data Preparation and Download\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YDVtdJxhZWju"
      },
      "source": [
        "## Data for Prediction\r\n",
        "For prediction, let's download a text file from [http://www.gutenberg.org/files/48874/48874-0.txt](http://www.gutenberg.org/files/48874/48874-0.txt)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bkeKX2I_tIgV"
      },
      "source": [
        "## create data directory and download an audio file\n",
        "WORK_DIR = 'WORK_DIR'\n",
        "DATA_DIR = WORK_DIR + '/DATA'\n",
        "os.makedirs(DATA_DIR, exist_ok=True)\n",
        "text_file = '48874-0.txt'\n",
        "if not os.path.exists(os.path.join(DATA_DIR, text_file)):\n",
        "    print('Downloading text file')\n",
        "    wget.download('http://www.gutenberg.org/files/48874/' + text_file, DATA_DIR)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yyUE_t4vw2et"
      },
      "source": [
        "The `DATA_DIR` should now contain the text file"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VXrTzTyIpzE8"
      },
      "source": [
        "!ls -l $DATA_DIR"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FWqlbSryw_WL"
      },
      "source": [
        "print the first 10 lines of the file :"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1vC2DHawIGt8"
      },
      "source": [
        "! head -n 10 $DATA_DIR/$text_file"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kmDTCuTLH7pm"
      },
      "source": [
        "# Prediction\n",
        "Here we will show `$TOOLS_DIR/run_predict.py` step by step\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6R7OKAsYH9p0"
      },
      "source": [
        "\r\n",
        "import tools.text_normalization.verbalizer as verbalizer \r\n",
        "import tools.text_normalization.tagger as tagger \r\n",
        "import tools.text_normalization.normalize as normalize\r\n",
        "from tools.text_normalization.run_predict import load_file, write_file"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "O7VpoBMqocQQ"
      },
      "source": [
        "data = load_file(f\"{DATA_DIR}/{text_file}\")\r\n",
        "print(len(data), \"sentences\") "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3etPl6smpYrE"
      },
      "source": [
        "If you want to see how things were normalized, turn on `verbose=True` flag\r\n",
        "\r\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "f-AAXW8ZovZo"
      },
      "source": [
        "# normalized_sentences = normalize.normalize_nemo(data, verbose=True)\r\n",
        "normalized_sentences = normalize.normalize_nemo(data, verbose=False)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zGz5KdZjprH_"
      },
      "source": [
        "# Saves output to file\r\n",
        "output_file_path=f\"{DATA_DIR}/{text_file}.normalized\"\r\n",
        "write_file(file_path=output_file_path, data=normalized_sentences)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7RPvYsICqEZZ"
      },
      "source": [
        "# Check file is store correctly\r\n",
        "! ls -l $output_file_path\r\n",
        "! head -n 10 $output_file_path"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RMT5lkPYzZHK"
      },
      "source": [
        "## Data for Evaluation\n",
        "\n",
        "The data for Evaluation needs to be segmented and labeled by semiotic class, following the format of [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish).\n",
        "That is, every line of the file needs to have the format `<semiotic class>\\t<unnormalized text>\\t<self>` if it's trivial class or `<semiotic class>\\t<unnormalized text>\\t<normalized text>` in case of a semiotic class\n",
        "`WHITELIST` is the semiotic class for all whitelisted tokens whose verbalizations are directly looked up from `$TOOLS_DATA_DIR/whitelist.tsv`. To extend the list simply add further key-value pairs to the file.\n",
        "\n",
        "\n",
        "We will create a simple example file to show how evaluation works:\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "u4zjeVVv-UXR"
      },
      "source": [
        "eval_input_data =  \"\"\"LETTERS\\tA & E\\ta and e\r\n",
        "PUNCT\\t.\\tsil\r\n",
        "PLAIN\\tRetrieved\\t<self>\r\n",
        "DATE\\t2006-08-05\\tthe fifth of august two thousand six\r\n",
        "PUNCT\\t.\\tsil\r\n",
        "<eos>\\t<eos>\r\n",
        "PLAIN\\tDownloaded\\t<self>\r\n",
        "PLAIN\\ton\\t<self>\r\n",
        "DATE\\t7 August 2007\\tthe seventh of august two thousand seven\r\n",
        "PUNCT\\t.\\tsil\"\"\"\r\n",
        "eval_text_file_path = f\"{DATA_DIR}/00001-of-00100\"\r\n",
        "with open(eval_text_file_path, 'w') as fp:\r\n",
        "  fp.write(eval_input_data)\r\n",
        "! cat $eval_text_file_path\r\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bIvKBwRcH_9W"
      },
      "source": [
        "# Evaluation\r\n",
        "\r\n",
        "Here we will show `$TOOLS_DIR/run_evaluate.py` step by step\r\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "74GLpMgoICmk"
      },
      "source": [
        "\n",
        "import tools.text_normalization.verbalizer as verbalizer \n",
        "import tools.text_normalization.tagger as tagger \n",
        "import tools.text_normalization.normalize as normalize\n",
        "from tools.text_normalization.run_predict import load_file, write_file\n",
        "from tools.text_normalization.utils import (\n",
        "    evaluate,\n",
        "    known_types,\n",
        "    load_files,\n",
        "    training_data_to_sentences,\n",
        "    training_data_to_tokens,\n",
        ")\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7G-n2jYM0HpO"
      },
      "source": [
        "eval_input_data = load_files([eval_text_file_path])\r\n",
        "print(eval_input_data)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nYm8phOe1qa6"
      },
      "source": [
        "print(\"Sentence level evaluation...\")\r\n",
        "sentences_un_normalized, sentences_normalized = training_data_to_sentences(eval_input_data)\r\n",
        "print(\"- Data: \" + str(len(sentences_un_normalized)) + \" sentences\")\r\n",
        "sentences_prediction = normalize.normalize_nemo(sentences_un_normalized)\r\n",
        "print(\"- Normalized. Evaluating...\")\r\n",
        "sentences_accuracy = evaluate(sentences_prediction, sentences_normalized, sentences_un_normalized)\r\n",
        "print(\"- Accuracy: \" + str(sentences_accuracy))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EwNGHovV2FYT"
      },
      "source": [
        "print(\"Token level evaluation...\")\r\n",
        "tokens_per_type = training_data_to_tokens(eval_input_data)\r\n",
        "token_accuracy = {}\r\n",
        "for token_type in tokens_per_type:\r\n",
        "    print(\"- Token type: \" + token_type)\r\n",
        "    tokens_un_normalized, tokens_normalized = tokens_per_type[token_type]\r\n",
        "    print(\"  - Data: \" + str(len(tokens_un_normalized)) + \" tokens\")\r\n",
        "    tokens_prediction = normalize.normalize_nemo(tokens_un_normalized)\r\n",
        "    print(\"  - Normalized. Evaluating...\")\r\n",
        "    token_accuracy[token_type] = evaluate(tokens_prediction, tokens_normalized, tokens_un_normalized)\r\n",
        "    print(\"  - Accuracy: \" + str(token_accuracy[token_type]))\r\n",
        "token_count_per_type = {token_type: len(tokens_per_type[token_type][0]) for token_type in tokens_per_type}\r\n",
        "token_weighted_accuracy = [\r\n",
        "    token_count_per_type[token_type] * accuracy for token_type, accuracy in token_accuracy.items()\r\n",
        "]\r\n",
        "print(\"- Accuracy: \" + str(sum(token_weighted_accuracy) / sum(token_count_per_type.values())))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lcvT3P2lQ_GS"
      },
      "source": [
        "# Notes\n",
        "\n",
        "The current system expects well-formed sentences and word boundaries. The default expects a semiotic token to be surrounded by a non-word token. E.g. `A & E` will be detected as `VERBATIM`, however `A&E` will not be detected due to missing spaces around `&`. As an exercise, adjust the word boundary definition in [tools/text_normalization/tagger.py](https://github.com/NVIDIA/NeMo/blob/main/tools/text_normalization/tagger.py) to accommodate this too."
      ]
    }
  ]
}